Search CORE

39 research outputs found

Vektorski prikaz riječi utemeljen na velikim mrežnim korpusima kao moćan leksikografski alat

Author: Radovan Garabík
Publication venue: 'Institute of Croatian Language and Linguistics'
Publication date: 01/01/2020
Field of study

The Aranea Project offers a set of comparable corpora for two dozens of (mostly European) languages providing a convenient dataset for nLP applications that require training on large amounts of data. The article presents word embedding models trained on the Aranea corpora and an online interface to query the models and visualize the results. The implementation is aimed towards lexicographic use but can be also useful in other fields of linguistic study since the vector space is a plausible model of semantic space of word meanings. Three different models are available – one for a combination of part of speech and lemma, one for raw word forms, and one based on fastText algorithm uses subword vectors and is not limited to whole or known words in finding their semantic relations. The article is describing the interface and major modes of its functionality; it does not try to perform detailed linguistic analysis of presented examples.Projekt Aranea sadržava niz usporednih korpusa za 24 (uglavnom europskih) jezika. On pruža prikladan skup podataka za aplikacije za obradu prirodnoga jezika (nLP) koje zahtijevaju učenje na velikoj količini podataka. U radu se prikazuju modeli vektorskoga prikaza riječi koji su uspostavljeni učenjem na korpusima Aranea te mrežno sučelje kako bi se propitali modeli i vizualizirali rezultati. To može biti korisno za leksikografsku praksu, ali i u drugim područjima leksikografskoga proučavanja jer je vektorski prostor vjerodostojan model semantičkoga prostora značenja riječi. Postoje tri moguća modela: prvi za kombinaciju vrste riječi i leme, drugi za sirove forme riječi i treći koji se temelji na algoritmu fastText koji upotrebljava vektore na razini nižoj od riječi i nije ograničen na cijele riječi ili poznate riječi pri pronalaženju semantičkih odnosa. U radu se opisuju sučelje i osnovni modeli njegova funkcioniranja, ali se ne pokušava provesti iscrpna jezična analiza prikazanih primjera

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Bilingual Corpus - Digital Repository for Preservation of Language Heritage

Author: Dimitrova Ludmila
Garabík Radovan
Publication venue: Institute of Mathematics and Informatics Bulgarian Academy of Sciences
Publication date: 01/01/2012
Field of study

The article briefly reviews bilingual Slovak-Bulgarian/Bulgarian-Slovak parallel and aligned corpus. The corpus is collected and developed as results of the collaboration in the frameworks of the joint research project between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, and Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences. The multilingual corpora are large repositories of language data with an important role in preserving and supporting the world's cultural heritage, because the natural language is an outstanding part of the human cultural values and collective memory, and a bridge between cultures. This bilingual corpus will be widely applicable to the contrastive studies of the both Slavic languages, will also be useful resource for language engineering research and development, especially in machine translation

Bulgarian Digital Mathematics Library at IMI-BAS

Naivno terminološko označivanje zakonskih tekstova u slovačkom – može li biti korisno?

Author: Garabík Radovan
Levická Jana
Publication venue: 'Institute of Croatian Language and Linguistics'
Publication date: 01/01/2022
Field of study

Correct automatic terminological annotation of texts in a corpus can be sometimes a challenging task, especially for moderately or heavily inflected languages with relatively free word order. We explore the possibility of simple annotation based on sequence matching of lemmatized texts to annotate Slovak language corpus with IATE terminological entries. The accuracy of annotating legal language is very good when annotating multiword terms, while accuracy of single-word terms can be increased by applying simple filters based on word lengths and blacklisting most frequent false positives.Ispravna automatska terminološka anotacija tekstova u korpusu ponekad može biti izazovan zadatak, posebno za iznimno flektivne jezike s razmjerno slobodnim redoslijedom riječi. U članku istražujemo mogućnost jednostavne anotacije na temelju podudarnosti lematiziranih tekstova kako bi korpus slovačkoga jezika bio anotiran terminološkim zapisima IATE. Točnost anotacije višerječnih termina vrlo je dobra, dok se točnost jednorječnih termina može povisiti primjenom jednostavnih filtara na temelju duljine riječi i stavljanja na crnu listu najčešćih lažnih pozitivnih rezultata

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Accuracy of Slovak Language Lemmatization and MSD Tagging – MorphoDiTa and SpaCy

Author: Garabík Radovan
Mitana Denis
Publication venue
Publication date: 01/01/2022
Field of study

The Slovak language, as a “typical” Slavic language, belongs to the group of moderately inflected languages, with three or four genders, two grammatical numbers, all interacting with the inflections in somewhat complicated and unpredictable ways. The inflections are realized primarily by suffixes, but with many irregularities; one suffix encodes several relevant grammatical categories and the same suffix often reflects unrelated features in other words, a typical inflectional language not amenable to a heuristic analysis. Following these limitations, lemmatization is often an indispensable step in all kinds of text processing (starting with full-text search), and full morphosyntactic analysis or description (MSD) is the core of corpus linguistic research. Given the core importance of lemmatization and MSD in Slovak corpus linguistics, it is important to realize its limitations and recognize achievable accuracy. Since modern approaches aim to utilize deep learning and huge language models, we evaluate the accuracy of lemmatization + MSD in several common usage scenarios by comparing the state-of-the-art “classical” lemmatizer and MSD tagger MorhoDiTa, based on perceptron; and spaCy, using a multilingual BERT language model

Mykolas Romeris University Institutional Repository

Translation equivalence of demonstrative pronouns in Bulgarian-Slovak parallel texts

Author: Dimitrova Ludmila
Garabík Radovan
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/01/2014
Field of study

Translation equivalence of demonstrative pronouns in Bulgarian-Slovak parallel textsIn this paper we describe our automatic analysis of several parallel Bulgarian-Slovak texts with the goal to obtain useful information about Slovak translation equivalents of (definite) articles and demonstrative pronouns in Bulgarian. Rather than focusing on individual translation equivalents, we present a method for automatic extraction and visualization of the translations. This can serve as a guide for pinpointing interesting features in specific translated documents and could be extended for other parts of speech or otherwise identifiable textual units

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel Corpus

Author: Dimitrova Ludmila
Garabík Radovan
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/01/2015
Field of study

Extraction and Presentation of Bilingual Correspondences from Slovak-Bulgarian Parallel CorpusIn this paper the results of the automatic extraction and presentation of bilingual correspondences from Slovak-Bulgarian Parallel corpus are described. The equivalent phrases are extracted from sentence and word level automatically aligned corpus, filtered, indexed and presented in a dictionary-like interface. The bilingual dictionary database contains 80 thousand phrase pairs consisting of approximately 350 thousand words (per each language). Counting unique word forms, the size is 31 thousand in the Slovak part of the dictionary, 26 thousand in the Bulgarian part

Crossref

Biblioteka Nauki - repozytorium artykuÅÃ³w

Directory of Open Access Journals

Web presentation of bilingual corpora (Slovak-Bulgarian and Bulgarian-Polish)

Author: Dimitrova Ludmila
Garabík Radovan
Koseska-Toszewa Violetta
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/11/2015
Field of study

Web presentation of bilingual corpora (Slovak-Bulgarian and Bulgarian-Polish)In this paper we focus on the web-presentation of bilingual corpora in three Slavic languages and their possible applications. Slovak-Bulgarian and Bulgarian-Polish corpora are collected and developed as results of the collaboration in the frameworks of two joint research projects between Institute of Mathematics and Informatics, Bulgarian Academy of Sciences, from one side, and from the other side: Ľ. Štúr Institute of Linguistics, Slovak Academy of Sciences and Institute of Slavic Studies, Polish Academy of Sciences, coordinate by authors of this paper

Directory of Open Access Journals

Main results of MONDILEX project

Author: Dimitrova Ludmila
Erjavec Tomaž
Garabík Radovan
Iomdin Leonid
Koseska-Toszewa Violetta
Shyrokov Volodymyr
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date: 01/11/2015
Field of study

Main results of MONDILEX projectThe paper presents the results and recommendations of MONDILEX, a 7FP project that covered six Slavic languages: Bulgarian, Polish, Russian, Slovak, Slovene, and Ukrainian. The paper summarizes the research undertaken on standardisation and integration of Slavic language resources and on the establishment of a virtual organisation supporting research infrastructure for Slavic lexicography. The results should be useful for an implementation of a research infrastructure in the coming years

Directory of Open Access Journals

The strategic impact of META-NET on the regional, national and international level

This article provides an overview of the dissemination work carried out in META-NET from 2010 until 2015; we describe its impact on the regional, national and international level, mainly with regard to politics and the funding situation for LT topics. The article documents the initiative's work throughout Europe in order to boost progress and innovation in our field.Peer reviewe

Institutional Repository Universiteit Antwerpen

The University of Manchester - Institutional Repository

Helsingin yliopiston digitaalinen arkisto

Utrecht University Repository

The Strategic Impact of META-NET on the Regional, National and International Level

Author: Georg Rehm Hans Uszkoreit, Sophia Ananiadou, Núria Bel, Audrone Bieleviciene, Lars Borin, António Branco, Gerhard Budin, Nicoletta Calzolari, Walter Daelemans, Radovan Garabík, Marko Grobelnik, Carmen Garcia-Mateo, Josef Van Genabith, Jan Hajic, Inma Hernaez, John Judge, Svetla Koeva, Simon Krek, Cvetana Krstev, Krister Lindén, Bernardo Magnini, Joseph Mariani, John Mcnaught, Maite Melero, Monica Monachini, Asuncion Moreno, Jan Odijk, Maciej Ogrodniczuk, Piotr Pezik, Stelios Piperidis, Adam Przepiórkowski, Eiríkur Rögnvaldsson, Michael Rosner, Bolette Sandford Pedersen, Inguna Skadina, Koenraad De Smedt, Marko Tadić, Paul Thompson, Dan Tufiș, Tamás Váradi, Andrejs Vasiljevs, Kadri Vider, Jolanta Zabarskaite
Publication venue: European Language Resources Association (ELRA)
Publication date: 26/05/2014
Field of study

This article provides an overview of the dissemination work carried out in META-NET from 2010 until early 2014; we describe its impact on the regional, national and international level, mainly with regard to politics and the situation of funding for LT topics. This paper documents the initiative’s work throughout Europe in order to boost progress and innovation in our field.Peer reviewe

Helsingin yliopiston digitaalinen arkisto